The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.
bank-full.csv - all examples and 17 inputs, ordered by date (older version of this dataset with less inputs).
Available from: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing
Using the data collected from existing customers, build a model that will help the marketing team identify potential customers who are relatively more likely to subscribe term deposit and thus increase their hit ratio.
1. age: Continuous feature
2. job: Type of job (management, technician, entrepreneur, blue-collar, etc.)
3. marital: marital status (married, single, divorced)
4. education: education level (primary, secondary, tertiary)
5. default: has credit in default?
6. housing: has housing loan?
7. loan: has personal loan?
8. balance in account
9. contact: contact communication type
10. month: last contact month of year
11. day: last contact day of the month
12. duration: last contact duration, in seconds*
13. campaign: number of contacts performed during this campaign and for this client
14. pdays: number of days that passed by after the client was last contacted from a previous campaign (-1 tells us the person has not been contacted or contact period is beyond 900 days)
15. previous: number of contacts performed before this campaign and for this client
16. poutcome: outcome of the previous marketing campaign
17. Target: Tell us has the client subscribed a term deposit. (Yes, No)
# limit warning messages
import warnings
warnings.filterwarnings('ignore')
# import numpy & pandas
from os import system
import numpy as np
import pandas as pd
# import plot libraries & enable plots in Jupyter notebook
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import pandas_profiling
# import sklearn tools
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, recall_score, precision_score, f1_score, roc_curve, roc_auc_score,accuracy_score
from sklearn import tree
from IPython.display import Image
# import advanced metrics from yellowbrick
from yellowbrick.classifier import ClassificationReport, ROCAUC
# Remove scientific notations and display numbers with 2 decimal points instead
pd.options.display.float_format = '{:,.3f}'.format
# Increase cell width
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:98% !important; }</style>"))
# Update default style and size of charts
plt.style.use('ggplot')
# Increase max number of rows and columns to display in pandas tables
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_rows', 500)
# import dataset from csv file
df = pd.read_csv('bank-full.csv')
# display first 10 rows of data
df.head(10)
# shape of dataset
df.shape
# data types and info
df.info()
# create numeric and object datasets
df_num = df.select_dtypes([np.int64])
df_obj = df.select_dtypes([np.object])
# description of attributes for numerical data types / statistical summary
df_num.describe().transpose()
# description of attributes for 'object' data types
df_obj.describe().transpose()
# summary of missing values (null) for entire dataset
df.isnull().sum()
# find number of 'unknown' values in 'object' features of dataset
for i, col in enumerate(df_obj.columns):
print(col)
print(df_obj[col].value_counts(normalize=False).iloc[lambda x: x.index == 'unknown'])
print()
# number of unique values in each column
df.nunique()
# visualize numerical data distributions with boxplots
for i, col in enumerate(df_num.columns):
plt.figure(i, figsize=(20,2))
sns.boxplot(x=df_num[col])
plt.show()
# visualize numerical data distributions with histograms
for i, col in enumerate(df_num.columns):
plt.figure(i, figsize=(20,2))
sns.distplot(df_num[col], kde=False, bins=100)
plt.show()
# pairplot numerical data
sns.pairplot(df_num)
# Visualize object data with countplots (counts in descending order)
df_obj = df.select_dtypes([np.object])
for i, col in enumerate(df_obj.columns):
sort_descending = df_obj[col].value_counts().sort_values(ascending=False).index
plt.figure(i, figsize=(20,4))
sns.countplot(x=col, order=sort_descending, data=df_obj)
plt.show()
1) The dataset contains a mixture of numeric and object datatypes.
2) The dataset does not appear to have missing or null values, however, it does contain a large number of 'unknown' values.
3) There are some outliers, invalid values, and special values that will need to be considered.
4) **Important note: this dataset's website states that 'duration' should be discarded for a realistic predictive model.
1) We will convert the 'object' features to numeric via one-hot encoding.
2) Although there is skew in some of the data, we will not adjust this for our initial models.
3) We will not remove 'unknown' values as these may be important to the model.
3) We will perform a correlation analysis to help derive relevant features for the model.
4) We will extract relevant features into the new dataset if possible.
# convert columns with 'object' datatype to categorical variables
df1 = df.copy()
for feature in df1.columns:
if df1[feature].dtype == 'object':
df1[feature] = pd.Categorical(df1[feature])
# check dataset infomation again
df1.info()
# one-hot encode categorical features (drop first column to simplify model)
df1 = pd.get_dummies(df1, drop_first=True)
# rename 'Target_yes' column back to 'Target' for clarity
df1.rename(columns={"Target_yes": "Target"}, inplace=True)
# check dataset information again
df1.info()
df1.describe().transpose()
# Determine correlation of predictor variables and target column
cor = df1.corr()
cor_target = abs(cor["Target"])
# Print features that highly correlated with the target feature
print('Highly Correlated Features (i.e. > 0.5):')
print(cor_target[cor_target>0.5])
print()
# Print features that have some correlation with the target feature
print('Correlated Features (i.e. > 0):')
print(cor_target[cor_target > 0])
print()
# Compare number of features in original dataset with number of correlated features
df1_num_cols = len(df1.columns)
cor_target_num_cols = len(cor_target[cor_target].index)
print(' Dataset Features = ', df1_num_cols)
print('Correlated Features = ', cor_target_num_cols)
1) There does not appear to be any strong correlation between two variables.
2) 'duration' and 'poutcome_success' have the highest correlation.
3) However, 'duration' should be removed according to this dataset's source website instructions.
X = df1.drop(["Target", "duration"], axis=1)
y = df1.pop("Target")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.30, random_state=1)
# Fit the model on train
model = LogisticRegression(solver="liblinear")
model.fit(X_train, y_train)
#predict on test
y_predict = model.predict(X_test)
coef_df = pd.DataFrame(model.coef_)
coef_df['intercept'] = model.intercept_
print(coef_df)
# Score the model
model_var = model
model_name = 'Logistic Regression'
print('Training Data Score:', model_var.score(X_train, y_train))
print(' Test Data Score:', model_var.score(X_test, y_test))
y_predict = model_var.predict(X_test)
model_accuracy = model_var.score(X_test, y_test)
model_recall = recall_score(y_test,y_predict)
model_precision = precision_score(y_test,y_predict)
model_f1 = f1_score(y_test,y_predict)
model_roc_auc = roc_auc_score(y_test,y_predict)
model_scores = {'Method': [model_name],
'Accuracy': model_accuracy,
'Recall': model_recall,
'Precision': model_precision,
'F1': model_f1,
'ROC AUC': model_roc_auc
}
# Store the accuracy results for each model in a dataframe for final comparison
resultsDf = pd.DataFrame(model_scores)
resultsDf = resultsDf[['Method', 'Accuracy', 'Recall', 'Precision','F1', 'ROC AUC']]
resultsDf
# Dsiplay the Confusion Matrix
y_predict = model.predict(X_test)
cm=metrics.confusion_matrix(y_test, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True, fmt='g')
# Visualize model performance with yellowbrick library
lg = LogisticRegression(solver = 'liblinear')
viz = ClassificationReport(lg)
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.show()
roc = ROCAUC(lg, macro=True, micro=False, per_class=False)
roc.fit(X_train, y_train)
roc.score(X_test, y_test)
roc.show();
dTree = DecisionTreeClassifier(criterion = 'gini', random_state=1)
dTree.fit(X_train, y_train)
# Score the decision tree
print('Training Data Score:', dTree.score(X_train, y_train))
print(' Test Data Score:', dTree.score(X_test, y_test))
# Comments: Decision tree as is over-fitted, we will need to regularize this below
# Visualize the tree (Works only if "dot" command works on you machine)
train_char_label = ['No', 'Yes']
Decision_Tree_File = open('decision_tree.dot','w')
dot_data = tree.export_graphviz(dTree, out_file=Decision_Tree_File, feature_names = list(X_train), class_names = list(train_char_label))
Decision_Tree_File.close()
# Note:
retCode = system("dot -Tpng decision_tree.dot -o decision_tree.png")
if(retCode>0):
print("system command returning error: "+str(retCode))
else:
display(Image("decision_tree.png"))
# Reducing Overfitting of model
dTreeR = DecisionTreeClassifier(criterion = 'gini', max_depth = 3, random_state=1)
dTreeR.fit(X_train, y_train)
# Comment: We get slightly better results with 'max_depth = 5' but with greater complexity for the model.
# Score the model
model_var = dTreeR
model_name = 'Decision Tree'
print('Training Data Score:', model_var.score(X_train, y_train))
print(' Test Data Score:', model_var.score(X_test, y_test))
y_predict = model_var.predict(X_test)
model_accuracy = model_var.score(X_test, y_test)
model_recall = recall_score(y_test,y_predict)
model_precision = precision_score(y_test,y_predict)
model_f1 = f1_score(y_test,y_predict)
model_roc_auc = roc_auc_score(y_test,y_predict)
model_scores = {'Method': model_name,
'Accuracy': model_accuracy,
'Recall': model_recall,
'Precision': model_precision,
'F1': model_f1,
'ROC AUC': model_roc_auc
}
# Record the score for later comparison
resultsDf.loc[1] = model_scores
resultsDf
# Visualize Reduced Tree Model
train_char_label = ['No', 'Yes']
Decision_Tree_FileR = open('decision_treeR.dot','w')
dot_data = tree.export_graphviz(dTreeR, out_file=Decision_Tree_FileR, feature_names = list(X_train), class_names = list(train_char_label))
Decision_Tree_FileR.close()
#Works only if "dot" command works on you machine
retCode = system("dot -Tpng decision_treeR.dot -o decision_treeR.png")
if(retCode>0):
print("system command returning error: "+str(retCode))
else:
display(Image("decision_treeR.png"))
# Dsiplay the Confusion Matrix
y_predict = dTreeR.predict(X_test)
cm=metrics.confusion_matrix(y_test, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True, fmt='g')
viz = ClassificationReport(DecisionTreeClassifier(criterion = "gini", max_depth=3))
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.show()
roc = ROCAUC(DecisionTreeClassifier(criterion = "entropy", max_depth=3))
roc.fit(X_train, y_train)
roc.score(X_test, y_test)
roc.show()
# Calculate feature importance
print (pd.DataFrame(dTreeR.feature_importances_, columns = ["Imp"], index = X_train.columns))
# Build Random Forest Model
rfcl = RandomForestClassifier(n_estimators = 30)
rfcl = rfcl.fit(X_train, y_train)
# Score the model
model_var = rfcl
model_name = 'Random Forest'
print('Training Data Score:', model_var.score(X_train, y_train))
print(' Test Data Score:', model_var.score(X_test, y_test))
y_predict = model_var.predict(X_test)
model_accuracy = model_var.score(X_test, y_test)
model_recall = recall_score(y_test,y_predict)
model_precision = precision_score(y_test,y_predict)
model_f1 = f1_score(y_test,y_predict)
model_roc_auc = roc_auc_score(y_test,y_predict)
model_scores = {'Method': model_name,
'Accuracy': model_accuracy,
'Recall': model_recall,
'Precision': model_precision,
'F1': model_f1,
'ROC AUC': model_roc_auc
}
# Record the score for later comparison
resultsDf.loc[2] = model_scores
resultsDf
# Display the Confusion Matrix
y_predict = rfcl.predict(X_test)
cm=metrics.confusion_matrix(y_test, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True, fmt='g')
# Visualize model performance with yellowbrick library
viz = ClassificationReport(RandomForestClassifier(n_estimators = 30))
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.show()
roc = ROCAUC(RandomForestClassifier(n_estimators = 30))
roc.fit(X_train, y_train)
roc.score(X_test, y_test)
roc.show()
# Build Bagging Classifier Model
bgcl = BaggingClassifier(n_estimators=50, max_samples= .7, bootstrap=True, oob_score=True, random_state=1)
bgcl = bgcl.fit(X_train, y_train)
# Score the model
model_var = bgcl
model_name = 'Bagging'
print('Training Data Score:', model_var.score(X_train, y_train))
print(' Test Data Score:', model_var.score(X_test, y_test))
y_predict = model_var.predict(X_test)
model_accuracy = model_var.score(X_test, y_test)
model_recall = recall_score(y_test,y_predict)
model_precision = precision_score(y_test,y_predict)
model_f1 = f1_score(y_test,y_predict)
model_roc_auc = roc_auc_score(y_test,y_predict)
model_scores = {'Method': model_name,
'Accuracy': model_accuracy,
'Recall': model_recall,
'Precision': model_precision,
'F1': model_f1,
'ROC AUC': model_roc_auc
}
# Record the score for later comparison
resultsDf.loc[3] = model_scores
resultsDf
# Dispaly the Confusion Matrix
y_predict = bgcl.predict(X_test)
cm=metrics.confusion_matrix(y_test, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True, fmt='g')
# Visualize model performance with yellowbrick library
viz = ClassificationReport(BaggingClassifier(n_estimators=50, max_samples= .7, bootstrap=True, oob_score=True, random_state=1))
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.show()
roc = ROCAUC(BaggingClassifier(n_estimators=50, max_samples= .7, bootstrap=True, oob_score=True, random_state=1))
roc.fit(X_train, y_train)
roc.score(X_test, y_test)
roc.show()
# Build Adaboost Classifier Model
abcl = AdaBoostClassifier(n_estimators = 100, learning_rate=0.1, random_state=1)
abcl = abcl.fit(X_train, y_train)
# Score the model
model_var = abcl
model_name = 'Adaboost'
print('Training Data Score:', model_var.score(X_train, y_train))
print(' Test Data Score:', model_var.score(X_test, y_test))
y_predict = model_var.predict(X_test)
model_accuracy = model_var.score(X_test, y_test)
model_recall = recall_score(y_test,y_predict)
model_precision = precision_score(y_test,y_predict)
model_f1 = f1_score(y_test,y_predict)
model_roc_auc = roc_auc_score(y_test,y_predict)
model_scores = {'Method': model_name,
'Accuracy': model_accuracy,
'Recall': model_recall,
'Precision': model_precision,
'F1': model_f1,
'ROC AUC': model_roc_auc
}
# Record the score for later comparison
resultsDf.loc[4] = model_scores
resultsDf
# Display the confusion matrix
cm=metrics.confusion_matrix(y_test, y_predict, labels=[1, 0])
df_cm = pd.DataFrame(cm, index = [i for i in ["1","0"]],
columns = [i for i in ["Predict 1","Predict 0"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True, fmt='g')
# Visualize model performance with yellowbrick library
viz = ClassificationReport(AdaBoostClassifier(n_estimators= 100, learning_rate=0.1, random_state=1))
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.show()
roc = ROCAUC(AdaBoostClassifier(n_estimators= 100, learning_rate=0.1, random_state=1))
roc.fit(X_train, y_train)
roc.score(X_test, y_test)
roc.show()
# Build Gradient Boost Algorithm Model
gbcl = GradientBoostingClassifier(n_estimators = 50, learning_rate = 0.1, random_state=1)
gbcl = gbcl.fit(X_train, y_train)
# Score the model
model_var = gbcl
model_name = 'Gradient Boost'
print('Training Data Score:', model_var.score(X_train, y_train))
print(' Test Data Score:', model_var.score(X_test, y_test))
y_predict = model_var.predict(X_test)
model_accuracy = model_var.score(X_test, y_test)
model_recall = recall_score(y_test,y_predict)
model_precision = precision_score(y_test,y_predict)
model_f1 = f1_score(y_test,y_predict)
model_roc_auc = roc_auc_score(y_test,y_predict)
model_scores = {'Method': model_name,
'Accuracy': model_accuracy,
'Recall': model_recall,
'Precision': model_precision,
'F1': model_f1,
'ROC AUC': model_roc_auc
}
# Record the score for later comparison
resultsDf.loc[5] = model_scores
resultsDf
# Visualize model performance with yellowbrick library
viz = ClassificationReport(GradientBoostingClassifier(n_estimators = 50, learning_rate = 0.1, random_state=1))
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.show()
roc = ROCAUC(GradientBoostingClassifier(n_estimators = 50, learning_rate = 0.1, random_state=1))
roc.fit(X_train, y_train)
roc.score(X_test, y_test)
roc.show()
# Print dataset of comparison scores for the different models
resultsDf
Here the bank wants to better target their marketing towards customers that have the highest likelihood of taking a Term Deposit. False Negatives is the important metric because this is missed revenue by customers that would have taken a Term Deposit. Hence Recall is the important metric.
In the case of False Positives, the bank would lose minimal marketing costs.
In the case of False Negatives, the bank would miss revenue from missed loans that customers would have taken.
All of the models had very high and similar accuracy. However, the Bagging and Random Forest algorithms have the highest Recall scores.